Credit Card Users Churn Prediction

Description

The Thera bank recently saw a steep decline in the number of users of their credit card, credit cards are a good source of income for banks because of different kinds of fees charged by the banks like annual fees, balance transfer fees, and cash advance fees, late payment fees, foreign transaction fees, and others. Some fees are charged to every user irrespective of usage, while others are charged under specified circumstances. Customers’ leaving credit cards services would lead bank to loss, so the bank wants to analyze the data of customers and identify the customers who will leave their credit card services and reason for same – so that bank could improve upon those areas You as a Data scientist at Thera bank need to come up with a classification model that will help the bank improve its services so that customers do not renounce their credit cards You need to identify the best possible model that will give the required performance

Objective

1)Explore and visualize the dataset. 2)Build a classification model to predict if the customer is going to churn or not 3)Optimize the model using appropriate techniques 4)Generate a set of insights and recommendations that will help the bank

Questions

Questions: 1) What is the best possible model to predict a person who would churn or not? 2) What are the most significant variables at the model?

Data Dictionar

CLIENTNUM: Client number. Unique identifier for the customer holding the account Attrition_Flag: Internal event (customer activity) variable - if the account is closed then "Attrited Customer" else "Existing Customer" Customer_Age: Age in Years Gender: Gender of the account holder Dependent_count: Number of dependents Education_Level: Educational Qualification of the account holder - Graduate, High School, Unknown, Uneducated, College(refers to a college student), Post-Graduate, Doctorate. Marital_Status: Marital Status of the account holder Income_Category: Annual Income Category of the account holder Card_Category: Type of Card Months_on_book: Period of relationship with the bank Total_Relationship_Count: Total no. of products held by the customer Months_Inactive_12_mon: No. of months inactive in the last 12 months Contacts_Count_12_mon: No. of Contacts between the customer and bank in the last 12 months Credit_Limit: Credit Limit on the Credit Card Total_Revolving_Bal: The balance that carries over from one month to the next is the revolving balance Avg_Open_To_Buy: Open to Buy refers to the amount left on the credit card to use (Average of last 12 months) Total_Trans_Amt: Total Transaction Amount (Last 12 months) Total_Trans_Ct: Total Transaction Count (Last 12 months) Total_Ct_Chng_Q4_Q1: Ratio of the total transaction count in 4th quarter and the total transaction count in 1st quarter Total_Amt_Chng_Q4_Q1: Ratio of the total transaction amount in 4th quarter and the total transaction amount in 1st quarter Avg_Utilization_Ratio: Represents how much of the available credit the customer spent
In [1]:
import pandas as pd
import numpy as np
%matplotlib inline
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.impute import SimpleImputer
from sklearn.linear_model import LogisticRegression
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import (
    AdaBoostClassifier,
    GradientBoostingClassifier,
    RandomForestClassifier,
    BaggingClassifier,
)
from xgboost import XGBClassifier
from sklearn import metrics
from sklearn.model_selection import train_test_split, StratifiedKFold, cross_val_score
from sklearn.metrics import (
    f1_score,
    accuracy_score,
    recall_score,
    precision_score,
    confusion_matrix
)

from sklearn.preprocessing import StandardScaler, MinMaxScaler, OneHotEncoder
from sklearn.model_selection import RandomizedSearchCV
from sklearn.pipeline import Pipeline
from sklearn.compose import ColumnTransformer
from imblearn.over_sampling import SMOTE
from imblearn.under_sampling import RandomUnderSampler
pd.set_option("display.max_columns", None)
pd.set_option("display.float_format", lambda x: "%.3f" % x)
In [2]:
data= pd.read_csv("BankChurners.csv")
In [3]:
data.head()
Out[3]:
CLIENTNUM Attrition_Flag Customer_Age Gender Dependent_count Education_Level Marital_Status Income_Category Card_Category Months_on_book Total_Relationship_Count Months_Inactive_12_mon Contacts_Count_12_mon Credit_Limit Total_Revolving_Bal Avg_Open_To_Buy Total_Amt_Chng_Q4_Q1 Total_Trans_Amt Total_Trans_Ct Total_Ct_Chng_Q4_Q1 Avg_Utilization_Ratio
0 768805383 Existing Customer 45 M 3 High School Married $60K - $80K Blue 39 5 1 3 12691.000 777 11914.000 1.335 1144 42 1.625 0.061
1 818770008 Existing Customer 49 F 5 Graduate Single Less than $40K Blue 44 6 1 2 8256.000 864 7392.000 1.541 1291 33 3.714 0.105
2 713982108 Existing Customer 51 M 3 Graduate Married $80K - $120K Blue 36 4 1 0 3418.000 0 3418.000 2.594 1887 20 2.333 0.000
3 769911858 Existing Customer 40 F 4 High School NaN Less than $40K Blue 34 3 4 1 3313.000 2517 796.000 1.405 1171 20 2.333 0.760
4 709106358 Existing Customer 40 M 3 Uneducated Married $60K - $80K Blue 21 5 1 0 4716.000 0 4716.000 2.175 816 28 2.500 0.000
In [4]:
df=data.copy()
In [5]:
print(f"there are {df.shape[0]} rows and {df.shape[1]} columns")
there are 10127 rows and 21 columns
In [6]:
df.isnull().sum() #there are missing values on the Education_Level and Marital_Status columns
Out[6]:
CLIENTNUM                      0
Attrition_Flag                 0
Customer_Age                   0
Gender                         0
Dependent_count                0
Education_Level             1519
Marital_Status               749
Income_Category                0
Card_Category                  0
Months_on_book                 0
Total_Relationship_Count       0
Months_Inactive_12_mon         0
Contacts_Count_12_mon          0
Credit_Limit                   0
Total_Revolving_Bal            0
Avg_Open_To_Buy                0
Total_Amt_Chng_Q4_Q1           0
Total_Trans_Amt                0
Total_Trans_Ct                 0
Total_Ct_Chng_Q4_Q1            0
Avg_Utilization_Ratio          0
dtype: int64
In [7]:
df.duplicated().sum() #there is no duplicated rows on the data set
Out[7]:
0
In [8]:
df.dtypes
Out[8]:
CLIENTNUM                     int64
Attrition_Flag               object
Customer_Age                  int64
Gender                       object
Dependent_count               int64
Education_Level              object
Marital_Status               object
Income_Category              object
Card_Category                object
Months_on_book                int64
Total_Relationship_Count      int64
Months_Inactive_12_mon        int64
Contacts_Count_12_mon         int64
Credit_Limit                float64
Total_Revolving_Bal           int64
Avg_Open_To_Buy             float64
Total_Amt_Chng_Q4_Q1        float64
Total_Trans_Amt               int64
Total_Trans_Ct                int64
Total_Ct_Chng_Q4_Q1         float64
Avg_Utilization_Ratio       float64
dtype: object
In [9]:
df.describe(include="all").T
Out[9]:
count unique top freq mean std min 25% 50% 75% max
CLIENTNUM 10127.000 NaN NaN NaN 739177606.334 36903783.450 708082083.000 713036770.500 717926358.000 773143533.000 828343083.000
Attrition_Flag 10127 2 Existing Customer 8500 NaN NaN NaN NaN NaN NaN NaN
Customer_Age 10127.000 NaN NaN NaN 46.326 8.017 26.000 41.000 46.000 52.000 73.000
Gender 10127 2 F 5358 NaN NaN NaN NaN NaN NaN NaN
Dependent_count 10127.000 NaN NaN NaN 2.346 1.299 0.000 1.000 2.000 3.000 5.000
Education_Level 8608 6 Graduate 3128 NaN NaN NaN NaN NaN NaN NaN
Marital_Status 9378 3 Married 4687 NaN NaN NaN NaN NaN NaN NaN
Income_Category 10127 6 Less than $40K 3561 NaN NaN NaN NaN NaN NaN NaN
Card_Category 10127 4 Blue 9436 NaN NaN NaN NaN NaN NaN NaN
Months_on_book 10127.000 NaN NaN NaN 35.928 7.986 13.000 31.000 36.000 40.000 56.000
Total_Relationship_Count 10127.000 NaN NaN NaN 3.813 1.554 1.000 3.000 4.000 5.000 6.000
Months_Inactive_12_mon 10127.000 NaN NaN NaN 2.341 1.011 0.000 2.000 2.000 3.000 6.000
Contacts_Count_12_mon 10127.000 NaN NaN NaN 2.455 1.106 0.000 2.000 2.000 3.000 6.000
Credit_Limit 10127.000 NaN NaN NaN 8631.954 9088.777 1438.300 2555.000 4549.000 11067.500 34516.000
Total_Revolving_Bal 10127.000 NaN NaN NaN 1162.814 814.987 0.000 359.000 1276.000 1784.000 2517.000
Avg_Open_To_Buy 10127.000 NaN NaN NaN 7469.140 9090.685 3.000 1324.500 3474.000 9859.000 34516.000
Total_Amt_Chng_Q4_Q1 10127.000 NaN NaN NaN 0.760 0.219 0.000 0.631 0.736 0.859 3.397
Total_Trans_Amt 10127.000 NaN NaN NaN 4404.086 3397.129 510.000 2155.500 3899.000 4741.000 18484.000
Total_Trans_Ct 10127.000 NaN NaN NaN 64.859 23.473 10.000 45.000 67.000 81.000 139.000
Total_Ct_Chng_Q4_Q1 10127.000 NaN NaN NaN 0.712 0.238 0.000 0.582 0.702 0.818 3.714
Avg_Utilization_Ratio 10127.000 NaN NaN NaN 0.275 0.276 0.000 0.023 0.176 0.503 0.999

Observation

1) There are 10127 rows and 21 columns in the data set 2) There are misssing values on the Education_Level and Marital_Status columns. 3) There is no duplicated rows. 4) Some features should be converted to categorical. 5) Some columns data types are numerical but they should be categorical.

Data Prosessing

In [10]:
df=df.drop(["CLIENTNUM"], axis=1)
In [11]:
list_str = list(df.select_dtypes(include=['object']).columns)
list_str
Out[11]:
['Attrition_Flag',
 'Gender',
 'Education_Level',
 'Marital_Status',
 'Income_Category',
 'Card_Category']
In [12]:
cat_list=["Attrition_Flag","Gender","Education_Level","Marital_Status","Income_Category","Card_Category","Dependent_count","Total_Relationship_Count","Months_Inactive_12_mon","Contacts_Count_12_mon"]
In [13]:
for column in cat_list:
    df[column]=df[column].astype("category")
In [14]:
df.dtypes # recheck the data types
Out[14]:
Attrition_Flag              category
Customer_Age                   int64
Gender                      category
Dependent_count             category
Education_Level             category
Marital_Status              category
Income_Category             category
Card_Category               category
Months_on_book                 int64
Total_Relationship_Count    category
Months_Inactive_12_mon      category
Contacts_Count_12_mon       category
Credit_Limit                 float64
Total_Revolving_Bal            int64
Avg_Open_To_Buy              float64
Total_Amt_Chng_Q4_Q1         float64
Total_Trans_Amt                int64
Total_Trans_Ct                 int64
Total_Ct_Chng_Q4_Q1          float64
Avg_Utilization_Ratio        float64
dtype: object
In [15]:
for column in cat_list:
    print(df[column].value_counts())
    print("#"*50)
Existing Customer    8500
Attrited Customer    1627
Name: Attrition_Flag, dtype: int64
##################################################
F    5358
M    4769
Name: Gender, dtype: int64
##################################################
Graduate         3128
High School      2013
Uneducated       1487
College          1013
Post-Graduate     516
Doctorate         451
Name: Education_Level, dtype: int64
##################################################
Married     4687
Single      3943
Divorced     748
Name: Marital_Status, dtype: int64
##################################################
Less than $40K    3561
$40K - $60K       1790
$80K - $120K      1535
$60K - $80K       1402
abc               1112
$120K +            727
Name: Income_Category, dtype: int64
##################################################
Blue        9436
Silver       555
Gold         116
Platinum      20
Name: Card_Category, dtype: int64
##################################################
3    2732
2    2655
1    1838
4    1574
0     904
5     424
Name: Dependent_count, dtype: int64
##################################################
3    2305
4    1912
5    1891
6    1866
2    1243
1     910
Name: Total_Relationship_Count, dtype: int64
##################################################
3    3846
2    3282
1    2233
4     435
5     178
6     124
0      29
Name: Months_Inactive_12_mon, dtype: int64
##################################################
3    3380
2    3227
1    1499
4    1392
0     399
5     176
6      54
Name: Contacts_Count_12_mon, dtype: int64
##################################################

Observation

1)Attrition_flag is the target variable, and it is unbalanced; the existing customer is 8500, the attrited customer is 1627, data can be oversampling or/and undersampling 2) The number of female customers is more than the number of male customers. 3)Education_Level has 6, and Marital_Status has 3 categories. 4)Income_Category has a category that named abc is misnamed. This value should be code as NaN. 5) Blue card category is higher than the other categories. 6)Dependent_count and Total_Relationship_Count have 6 six categories, Months_Inactive_12_mon and Contacts_Count_12_mon have 7 categories.

EDA

Univariate Analysis¶

In [16]:
def labeled_barplot(data, feature, percentage=False, n=None):
    total = len(data[feature])  
    count = data[feature].nunique()
    if n is None:
        plt.figure(figsize=(count + 1, 6))
    else:
        plt.figure(figsize=(n + 1, 6))

    plt.xticks(rotation=45, fontsize=10)
    ax = sns.countplot(
        data=data,
        x=feature,
        palette="Set2",
        order=data[feature].value_counts().index[:n].sort_values(),
    )

    for p in ax.patches:
        if percentage == True:
            label = "{:.1f}%".format(
                100 * p.get_height() / total
            ) 
        else:
            label = p.get_height() 

        x = p.get_x() + p.get_width() / 2  
        y = p.get_height() 

        ax.annotate(
            label,
            (x, y),
            ha="center",
            va="center",
            size=12,
            xytext=(0, 5),
            textcoords="offset points",
        ) 

    plt.show()  
In [17]:
for feature in cat_list:
    print(labeled_barplot(df, feature, percentage=True))
    print("#"*80)
None
################################################################################
None
################################################################################
None
################################################################################
None
################################################################################
None
################################################################################
None
################################################################################
None
################################################################################
None
################################################################################
None
################################################################################
None
################################################################################
In [18]:
def boxplot_hitogram(data, feature, figsize=(10, 5), kde=True, bins=20):
    
    f2, (ax_box2, ax_hist2) = plt.subplots(
        nrows=2, 
        sharex=True,  
        gridspec_kw={"height_ratios": (0.25, 0.75)},
        figsize=figsize,
    )  
    sns.boxplot(
        data=data, x=feature, ax=ax_box2, showmeans=True, color="yellowgreen"
    ) 
    sns.histplot(
        data=data, x=feature, kde=kde, ax=ax_hist2, bins=bins, palette="pink"
    ) if bins else sns.histplot(
        data=data, x=feature, kde=kde, ax=ax_hist2
    )  
    ax_hist2.axvline(
        data[feature].mean(), color="lime", linestyle="-"
    )  
    ax_hist2.axvline(
        data[feature].median(), color="red", linestyle="--"
    ) 
In [19]:
num_list= list(df.select_dtypes(include=['float64',"int64"]).columns)
num_list
Out[19]:
['Customer_Age',
 'Months_on_book',
 'Credit_Limit',
 'Total_Revolving_Bal',
 'Avg_Open_To_Buy',
 'Total_Amt_Chng_Q4_Q1',
 'Total_Trans_Amt',
 'Total_Trans_Ct',
 'Total_Ct_Chng_Q4_Q1',
 'Avg_Utilization_Ratio']
In [20]:
df.describe().T
Out[20]:
count mean std min 25% 50% 75% max
Customer_Age 10127.000 46.326 8.017 26.000 41.000 46.000 52.000 73.000
Months_on_book 10127.000 35.928 7.986 13.000 31.000 36.000 40.000 56.000
Credit_Limit 10127.000 8631.954 9088.777 1438.300 2555.000 4549.000 11067.500 34516.000
Total_Revolving_Bal 10127.000 1162.814 814.987 0.000 359.000 1276.000 1784.000 2517.000
Avg_Open_To_Buy 10127.000 7469.140 9090.685 3.000 1324.500 3474.000 9859.000 34516.000
Total_Amt_Chng_Q4_Q1 10127.000 0.760 0.219 0.000 0.631 0.736 0.859 3.397
Total_Trans_Amt 10127.000 4404.086 3397.129 510.000 2155.500 3899.000 4741.000 18484.000
Total_Trans_Ct 10127.000 64.859 23.473 10.000 45.000 67.000 81.000 139.000
Total_Ct_Chng_Q4_Q1 10127.000 0.712 0.238 0.000 0.582 0.702 0.818 3.714
Avg_Utilization_Ratio 10127.000 0.275 0.276 0.000 0.023 0.176 0.503 0.999
In [21]:
boxplot_hitogram(df, "Customer_Age", figsize=(10, 5), kde=True, bins=50)

Customer ages ranged from 26 to 73, and the mean value is 46.3 and normally distributed. Mean and median scores are almost the same. There are outliers.

In [22]:
boxplot_hitogram(df, "Months_on_book", figsize=(10, 5), kde=True, bins=20)

The mean and median scores of the Months_on_book variable are the same. There are outliers.

In [23]:
boxplot_hitogram(df, "Credit_Limit", figsize=(10, 5), kde=True, bins=50)

Credit_Limit is right skewed. There are outliers.Mean score is 8631.9

In [24]:
boxplot_hitogram(df, "Total_Revolving_Bal", figsize=(10, 5), kde=True, bins=50)

Total_Revolving_Bal is right skewed. Box plot shows that there is no outliers.

In [25]:
boxplot_hitogram(df, "Avg_Open_To_Buy", figsize=(10, 5), kde=True, bins=50)

Avg_Open_To_Buy is skewed right and there are outliers.

In [26]:
boxplot_hitogram(df, "Total_Amt_Chng_Q4_Q1", figsize=(10, 5), kde=True, bins=50)

The mean of the Total_Amt_Chng_Q4_Q1 variable is 7469.1 and the mean and median scores are almost same. There are many outliers.

In [27]:
boxplot_hitogram(df, "Total_Trans_Ct", figsize=(10, 5), kde=True, bins=50)

Total_Trans_Ct is not normally distributed; there are two peaks. There are outliers.

In [28]:
boxplot_hitogram(df, "Total_Trans_Amt", figsize=(10, 5), kde=True, bins=50)

Total_Trans_Amt is not normally distributed; there are peaks; It can be categorical. There are outliers.

In [29]:
boxplot_hitogram(df, "Total_Ct_Chng_Q4_Q1", figsize=(10, 5), kde=True, bins=50)

Total_Ct_Chng_Q4_Q1 has outliers. It ranged from 0 to 3.714 and has 0.274 mean value.

In [30]:
boxplot_hitogram(df, "Avg_Utilization_Ratio", figsize=(10, 5), kde=True, bins=50)

Avg_Utilization_Ratio is skwed right and has 0.274 mean value and ranged from 0 to 0.999.

Bivariate Analysis

In [31]:
sns.pairplot(df, hue="Attrition_Flag")
Out[31]:
<seaborn.axisgrid.PairGrid at 0x1fb2e24e640>
In [32]:
plt.figure(figsize=(15,8)) #There is a very high correlation between  credit_Limit and Avg_Open_To_Buy, so one of them should be dropped.
sns.heatmap(df.corr(), annot=True)
Out[32]:
<matplotlib.axes._subplots.AxesSubplot at 0x1fb31300340>
In [33]:
def stc_barplot(data, predictor, target):
    count = data[predictor].nunique()
    sorter = data[target].value_counts().index[-1]
    tab1 = pd.crosstab(data[predictor], data[target], margins=True).sort_values(
        by=sorter, ascending=False
    )
    print(tab1)
    print("*" * 100)
    tab = pd.crosstab(data[predictor], data[target], normalize="index").sort_values(
        by=sorter, ascending=False
    )
    tab.plot(kind="bar", stacked=True, figsize=(count + 1, 5))
    plt.legend(
        loc="lower left",
        frameon=False,
    )
    plt.legend(loc="upper left", bbox_to_anchor=(1, 1))
    plt.show()
In [34]:
stc_barplot(df, "Education_Level", "Attrition_Flag")
Attrition_Flag   Attrited Customer  Existing Customer   All
Education_Level                                            
All                           1371               7237  8608
Graduate                       487               2641  3128
High School                    306               1707  2013
Uneducated                     237               1250  1487
College                        154                859  1013
Doctorate                       95                356   451
Post-Graduate                   92                424   516
****************************************************************************************************
In [35]:
stc_barplot(df, "Gender", "Attrition_Flag")
Attrition_Flag  Attrited Customer  Existing Customer    All
Gender                                                     
All                          1627               8500  10127
F                             930               4428   5358
M                             697               4072   4769
****************************************************************************************************
In [36]:
stc_barplot(df, "Marital_Status", "Attrition_Flag")
Attrition_Flag  Attrited Customer  Existing Customer   All
Marital_Status                                            
All                          1498               7880  9378
Married                       709               3978  4687
Single                        668               3275  3943
Divorced                      121                627   748
****************************************************************************************************
In [37]:
stc_barplot(df, "Income_Category", "Attrition_Flag")
Attrition_Flag   Attrited Customer  Existing Customer    All
Income_Category                                             
All                           1627               8500  10127
Less than $40K                 612               2949   3561
$40K - $60K                    271               1519   1790
$80K - $120K                   242               1293   1535
$60K - $80K                    189               1213   1402
abc                            187                925   1112
$120K +                        126                601    727
****************************************************************************************************
In [38]:
stc_barplot(df, "Card_Category", "Attrition_Flag")
Attrition_Flag  Attrited Customer  Existing Customer    All
Card_Category                                              
All                          1627               8500  10127
Blue                         1519               7917   9436
Silver                         82                473    555
Gold                           21                 95    116
Platinum                        5                 15     20
****************************************************************************************************
In [39]:
stc_barplot(df, "Dependent_count", "Attrition_Flag")
Attrition_Flag   Attrited Customer  Existing Customer    All
Dependent_count                                             
All                           1627               8500  10127
3                              482               2250   2732
2                              417               2238   2655
1                              269               1569   1838
4                              260               1314   1574
0                              135                769    904
5                               64                360    424
****************************************************************************************************
In [40]:
stc_barplot(df, "Total_Relationship_Count", "Attrition_Flag")
Attrition_Flag            Attrited Customer  Existing Customer    All
Total_Relationship_Count                                             
All                                    1627               8500  10127
3                                       400               1905   2305
2                                       346                897   1243
1                                       233                677    910
5                                       227               1664   1891
4                                       225               1687   1912
6                                       196               1670   1866
****************************************************************************************************
In [41]:
stc_barplot(df, "Months_Inactive_12_mon", "Attrition_Flag")
Attrition_Flag          Attrited Customer  Existing Customer    All
Months_Inactive_12_mon                                             
All                                  1627               8500  10127
3                                     826               3020   3846
2                                     505               2777   3282
4                                     130                305    435
1                                     100               2133   2233
5                                      32                146    178
6                                      19                105    124
0                                      15                 14     29
****************************************************************************************************
In [42]:
stc_barplot(df, "Contacts_Count_12_mon", "Attrition_Flag")
Attrition_Flag         Attrited Customer  Existing Customer    All
Contacts_Count_12_mon                                             
All                                 1627               8500  10127
3                                    681               2699   3380
2                                    403               2824   3227
4                                    315               1077   1392
1                                    108               1391   1499
5                                     59                117    176
6                                     54                  0     54
0                                      7                392    399
****************************************************************************************************
Observation
For the Total_Raletionship_count feature, attrited customers are higher on the 1 and 2 values than the other values. For the Months_Inactive_12_mon feature, attrited customers are higher on the 0 values than the other values. For the Contacts_Count_12_mon feature, Customers who have six times contacts with the bank attrited. There is a very high correlation between credit_Limit and Avg_Open_To_Buy, so one of them should be dropped.
In [43]:
plt.figure(figsize=(10,5))
sns.boxplot(x="Attrition_Flag", y="Customer_Age", data=df, orient="vertical")
Out[43]:
<matplotlib.axes._subplots.AxesSubplot at 0x1fb38ee07c0>
In [44]:
plt.figure(figsize=(10,5))
sns.boxplot(x="Attrition_Flag", y="Months_on_book", data=df, orient="vertical")
Out[44]:
<matplotlib.axes._subplots.AxesSubplot at 0x1fb38f60400>
In [45]:
plt.figure(figsize=(10,5))
sns.boxplot(x="Attrition_Flag", y="Credit_Limit", data=df, orient="vertical")
Out[45]:
<matplotlib.axes._subplots.AxesSubplot at 0x1fb38ee0640>
In [46]:
plt.figure(figsize=(10,5))
sns.boxplot(x="Attrition_Flag", y="Total_Revolving_Bal", data=df, orient="vertical")
Out[46]:
<matplotlib.axes._subplots.AxesSubplot at 0x1fb38f4ff40>
In [47]:
plt.figure(figsize=(10,5))
sns.boxplot(x="Attrition_Flag", y="Avg_Open_To_Buy", data=df, orient="vertical")
Out[47]:
<matplotlib.axes._subplots.AxesSubplot at 0x1fb39097fa0>
In [48]:
plt.figure(figsize=(10,5))
sns.boxplot(x="Attrition_Flag", y="Total_Amt_Chng_Q4_Q1", data=df, orient="vertical")
Out[48]:
<matplotlib.axes._subplots.AxesSubplot at 0x1fb39097880>
In [49]:
plt.figure(figsize=(10,5))
sns.boxplot(x="Attrition_Flag", y="Total_Trans_Amt", data=df, orient="vertical")
Out[49]:
<matplotlib.axes._subplots.AxesSubplot at 0x1fb3a1314c0>
In [50]:
plt.figure(figsize=(10,5))
sns.boxplot(x="Attrition_Flag", y="Total_Trans_Ct", data=df, orient="vertical")
Out[50]:
<matplotlib.axes._subplots.AxesSubplot at 0x1fb39097c10>
In [51]:
plt.figure(figsize=(10,5))
sns.boxplot(x="Attrition_Flag", y="Total_Ct_Chng_Q4_Q1", data=df, orient="vertical")
Out[51]:
<matplotlib.axes._subplots.AxesSubplot at 0x1fb3a1fe790>
In [52]:
plt.figure(figsize=(10,5))
sns.boxplot(x="Attrition_Flag", y="Avg_Utilization_Ratio", data=df, orient="vertical")
Out[52]:
<matplotlib.axes._subplots.AxesSubplot at 0x1fb3a1b03a0>
Observation

The ages of the attrited and existing customers have similar median scores. Attrited and existing customers' median scores of months on book are similar. Existing customers' total revolving balance median score is higher than the attrited customers' total revolving balance median score.

Missing Values

In [53]:
df.isnull().sum()
Out[53]:
Attrition_Flag                 0
Customer_Age                   0
Gender                         0
Dependent_count                0
Education_Level             1519
Marital_Status               749
Income_Category                0
Card_Category                  0
Months_on_book                 0
Total_Relationship_Count       0
Months_Inactive_12_mon         0
Contacts_Count_12_mon          0
Credit_Limit                   0
Total_Revolving_Bal            0
Avg_Open_To_Buy                0
Total_Amt_Chng_Q4_Q1           0
Total_Trans_Amt                0
Total_Trans_Ct                 0
Total_Ct_Chng_Q4_Q1            0
Avg_Utilization_Ratio          0
dtype: int64
In [54]:
df[df["Education_Level"].isnull()]
Out[54]:
Attrition_Flag Customer_Age Gender Dependent_count Education_Level Marital_Status Income_Category Card_Category Months_on_book Total_Relationship_Count Months_Inactive_12_mon Contacts_Count_12_mon Credit_Limit Total_Revolving_Bal Avg_Open_To_Buy Total_Amt_Chng_Q4_Q1 Total_Trans_Amt Total_Trans_Ct Total_Ct_Chng_Q4_Q1 Avg_Utilization_Ratio
6 Existing Customer 51 M 4 NaN Married $120K + Gold 46 6 1 3 34516.000 2264 32252.000 1.975 1330 31 0.722 0.066
11 Existing Customer 65 M 1 NaN Married $40K - $60K Blue 54 6 2 3 9095.000 1587 7508.000 1.433 1314 26 1.364 0.174
15 Existing Customer 44 M 4 NaN NaN $80K - $120K Blue 37 5 1 2 4234.000 972 3262.000 1.707 1348 27 1.700 0.230
17 Existing Customer 41 M 3 NaN Married $80K - $120K Blue 34 4 4 1 13535.000 1291 12244.000 0.653 1028 21 1.625 0.095
23 Existing Customer 47 F 4 NaN Single Less than $40K Blue 36 3 3 2 2492.000 1560 932.000 0.573 1126 23 0.353 0.626
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
10090 Existing Customer 36 F 3 NaN Married $40K - $60K Blue 22 5 3 3 12958.000 2273 10685.000 0.608 15681 96 0.627 0.175
10094 Existing Customer 59 M 1 NaN Single $60K - $80K Blue 48 3 1 2 7288.000 0 7288.000 0.640 14873 120 0.714 0.000
10095 Existing Customer 46 M 3 NaN Married $80K - $120K Blue 33 4 1 3 34516.000 1099 33417.000 0.816 15490 110 0.618 0.032
10118 Attrited Customer 50 M 1 NaN NaN $80K - $120K Blue 36 6 3 4 9959.000 952 9007.000 0.825 10310 63 1.100 0.096
10123 Attrited Customer 41 M 2 NaN Divorced $40K - $60K Blue 25 4 2 3 4277.000 2186 2091.000 0.804 8764 69 0.683 0.511

1519 rows × 20 columns

There is no trend between missing values and other variables, so mode will be used to fill the missing values.
In [55]:
df[df["Income_Category"]=="abc"]
Out[55]:
Attrition_Flag Customer_Age Gender Dependent_count Education_Level Marital_Status Income_Category Card_Category Months_on_book Total_Relationship_Count Months_Inactive_12_mon Contacts_Count_12_mon Credit_Limit Total_Revolving_Bal Avg_Open_To_Buy Total_Amt_Chng_Q4_Q1 Total_Trans_Amt Total_Trans_Ct Total_Ct_Chng_Q4_Q1 Avg_Utilization_Ratio
19 Existing Customer 45 F 2 Graduate Married abc Blue 37 6 1 2 14470.000 1157 13313.000 0.966 1207 21 0.909 0.080
28 Existing Customer 44 F 3 Uneducated Single abc Blue 34 5 2 2 10100.000 0 10100.000 0.525 1052 18 1.571 0.000
39 Attrited Customer 66 F 0 Doctorate Married abc Blue 56 5 4 3 7882.000 605 7277.000 1.052 704 16 0.143 0.077
44 Existing Customer 38 F 4 Graduate Single abc Blue 28 2 3 3 9830.000 2055 7775.000 0.977 1042 23 0.917 0.209
58 Existing Customer 44 F 5 Graduate Married abc Blue 35 4 1 2 6273.000 978 5295.000 2.275 1359 25 1.083 0.156
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
10021 Attrited Customer 30 F 1 Graduate Married abc Blue 18 4 1 4 4377.000 2517 1860.000 0.941 8759 74 0.609 0.575
10040 Attrited Customer 50 F 3 Doctorate Single abc Blue 36 4 3 3 5173.000 0 5173.000 0.912 8757 68 0.789 0.000
10083 Existing Customer 42 F 4 Uneducated Married abc Blue 23 4 1 2 8348.000 0 8348.000 0.695 15905 111 0.708 0.000
10092 Attrited Customer 40 F 3 Graduate Married abc Blue 25 1 2 3 6888.000 1878 5010.000 1.059 9038 64 0.829 0.273
10119 Attrited Customer 55 F 3 Uneducated Single abc Blue 47 4 3 3 14657.000 2517 12140.000 0.166 6009 53 0.514 0.172

1112 rows × 20 columns

In [56]:
df.groupby(by=["Income_Category"])["Gender"].value_counts()
Out[56]:
Income_Category  Gender
$120K +          M          727
$40K - $60K      F         1014
                 M          776
$60K - $80K      M         1402
$80K - $120K     M         1535
Less than $40K   F         3284
                 M          277
abc              F         1060
                 M           52
Name: Gender, dtype: int64
abc values are missing values; 1060 are female, and 52 are male. There is no gap between "Less than $40K" and "$120K", so these missing values should be one of the current values. Female is mostly "less than $40" or "$40K - $60K," so we can fill the missing values with mode (less than $40) because many females have this value. Before filling the missing values with mode, abc values should be replaced with NaN.
In [57]:
df["Income_Category"].replace({"abc":np.nan}, inplace=True)
In [58]:
df["Income_Category"].isnull().sum()
Out[58]:
1112
In [59]:
df[df["Marital_Status"].isnull()]
Out[59]:
Attrition_Flag Customer_Age Gender Dependent_count Education_Level Marital_Status Income_Category Card_Category Months_on_book Total_Relationship_Count Months_Inactive_12_mon Contacts_Count_12_mon Credit_Limit Total_Revolving_Bal Avg_Open_To_Buy Total_Amt_Chng_Q4_Q1 Total_Trans_Amt Total_Trans_Ct Total_Ct_Chng_Q4_Q1 Avg_Utilization_Ratio
3 Existing Customer 40 F 4 High School NaN Less than $40K Blue 34 3 4 1 3313.000 2517 796.000 1.405 1171 20 2.333 0.760
7 Existing Customer 32 M 0 High School NaN $60K - $80K Silver 27 2 2 2 29081.000 1396 27685.000 2.204 1538 36 0.714 0.048
10 Existing Customer 42 M 5 Uneducated NaN $120K + Blue 31 5 3 2 6748.000 1467 5281.000 0.831 1201 42 0.680 0.217
13 Existing Customer 35 M 3 Graduate NaN $60K - $80K Blue 30 5 1 3 8547.000 1666 6881.000 1.163 1311 33 2.000 0.195
15 Existing Customer 44 M 4 NaN NaN $80K - $120K Blue 37 5 1 2 4234.000 972 3262.000 1.707 1348 27 1.700 0.230
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
10070 Existing Customer 47 M 3 High School NaN $80K - $120K Silver 40 5 3 2 34516.000 1371 33145.000 0.691 15930 123 0.836 0.040
10100 Existing Customer 39 M 2 Graduate NaN $60K - $80K Silver 36 4 2 2 29808.000 0 29808.000 0.669 16098 128 0.684 0.000
10101 Existing Customer 42 M 2 Graduate NaN $40K - $60K Blue 30 3 2 5 3735.000 1723 2012.000 0.595 14501 92 0.840 0.461
10118 Attrited Customer 50 M 1 NaN NaN $80K - $120K Blue 36 6 3 4 9959.000 952 9007.000 0.825 10310 63 1.100 0.096
10125 Attrited Customer 30 M 2 Graduate NaN $40K - $60K Blue 36 4 3 3 5281.000 0 5281.000 0.535 8395 62 0.722 0.000

749 rows × 20 columns

In [60]:
df["Marital_Status"].value_counts()
Out[60]:
Married     4687
Single      3943
Divorced     748
Name: Marital_Status, dtype: int64

There is no trend between the missing values and the other features so the missing values will be filled with mode on the Marital_Status column.

Transformation

Some features are skewed and will be better behave on the arcsinh. np.arcsinh will be used for the transformation because some features are 0 values.

In [61]:
df1=df.copy()
In [62]:
df1.drop(columns=["Avg_Open_To_Buy"], inplace=True) #drop the Avg_Open_To_Buy because it has a very high correlation with the Credit_Limit variable
In [63]:
df1["Credit_Limit" + '_arc'] = np.arcsinh(df1["Credit_Limit"])
df1.drop("Credit_Limit", axis=1, inplace=True)
In [64]:
sns.distplot(df1["Credit_Limit_arc"])
C:\Users\kayaf\anaconda3\lib\site-packages\seaborn\distributions.py:2619: FutureWarning: `distplot` is a deprecated function and will be removed in a future version. Please adapt your code to use either `displot` (a figure-level function with similar flexibility) or `histplot` (an axes-level function for histograms).
  warnings.warn(msg, FutureWarning)
Out[64]:
<matplotlib.axes._subplots.AxesSubplot at 0x1fb3a2e9f40>
In [65]:
df1["Total_Amt_Chng_Q4_Q1" + '_arc'] = np.arcsinh(df1["Total_Amt_Chng_Q4_Q1"])
df1.drop("Total_Amt_Chng_Q4_Q1", axis=1, inplace=True)
In [66]:
sns.distplot(df1["Total_Amt_Chng_Q4_Q1_arc"])
C:\Users\kayaf\anaconda3\lib\site-packages\seaborn\distributions.py:2619: FutureWarning: `distplot` is a deprecated function and will be removed in a future version. Please adapt your code to use either `displot` (a figure-level function with similar flexibility) or `histplot` (an axes-level function for histograms).
  warnings.warn(msg, FutureWarning)
Out[66]:
<matplotlib.axes._subplots.AxesSubplot at 0x1fb38ee0b80>
In [67]:
df1["Total_Ct_Chng_Q4_Q1" + '_arc'] = np.arcsinh(df["Total_Ct_Chng_Q4_Q1"])
df1.drop("Total_Ct_Chng_Q4_Q1", axis=1, inplace=True)
In [68]:
sns.distplot(df1["Total_Ct_Chng_Q4_Q1_arc"])
C:\Users\kayaf\anaconda3\lib\site-packages\seaborn\distributions.py:2619: FutureWarning: `distplot` is a deprecated function and will be removed in a future version. Please adapt your code to use either `displot` (a figure-level function with similar flexibility) or `histplot` (an axes-level function for histograms).
  warnings.warn(msg, FutureWarning)
Out[68]:
<matplotlib.axes._subplots.AxesSubplot at 0x1fb3a1a38b0>
In [69]:
df1["Avg_Utilization_Ratio" + '_arc'] = np.arcsinh(df["Avg_Utilization_Ratio"])
df1.drop("Avg_Utilization_Ratio", axis=1, inplace=True)
In [70]:
sns.distplot(df1["Avg_Utilization_Ratio_arc"])
C:\Users\kayaf\anaconda3\lib\site-packages\seaborn\distributions.py:2619: FutureWarning: `distplot` is a deprecated function and will be removed in a future version. Please adapt your code to use either `displot` (a figure-level function with similar flexibility) or `histplot` (an axes-level function for histograms).
  warnings.warn(msg, FutureWarning)
Out[70]:
<matplotlib.axes._subplots.AxesSubplot at 0x1fb3a264a30>

Binning

Let's bin some numerical features which have peaks.

In [71]:
bins = [-np.inf, 50, 100, np.inf]
labels = ['<50', '50-100', '>100']

df1['Total_Trans_Ct_bins'] = pd.cut(df1['Total_Trans_Ct'], bins=bins, labels=labels, include_lowest=True)
In [72]:
sns.countplot(df1['Total_Trans_Ct_bins'])
C:\Users\kayaf\anaconda3\lib\site-packages\seaborn\_decorators.py:36: FutureWarning: Pass the following variable as a keyword arg: x. From version 0.12, the only valid positional argument will be `data`, and passing other arguments without an explicit keyword will result in an error or misinterpretation.
  warnings.warn(
Out[72]:
<matplotlib.axes._subplots.AxesSubplot at 0x1fb38b43760>
In [73]:
df1['Total_Trans_Ct_bins'].value_counts()
Out[73]:
50-100    6350
<50       3128
>100       649
Name: Total_Trans_Ct_bins, dtype: int64
In [74]:
df1.drop(columns=["Total_Trans_Ct"], inplace=True)
In [75]:
bins = [-np.inf, 1000, 2000, np.inf]
labels = ['<1000', '1000-2000', '>2000']

df1['Total_Revolving_Bal_bins'] = pd.cut(df1['Total_Revolving_Bal'], bins=bins, labels=labels, include_lowest=True)
df1.drop(columns=["Total_Revolving_Bal"], inplace=True)
In [76]:
sns.countplot(df1['Total_Revolving_Bal_bins'])
Out[76]:
<matplotlib.axes._subplots.AxesSubplot at 0x1fb38e84730>
In [77]:
df1['Total_Revolving_Bal_bins'].value_counts()
Out[77]:
1000-2000    4549
<1000        3913
>2000        1665
Name: Total_Revolving_Bal_bins, dtype: int64
In [78]:
bins = [-np.inf, 3000,6000,9000,12000,15000, 18000, np.inf]
labels = ['<3000', '3000-6000','6000-9000','9000-12000','12000-15000','15000-18000', '>18000']

df1['Total_Trans_Amt_bins'] = pd.cut(df1['Total_Trans_Amt'], bins=bins, labels=labels, include_lowest=True)
df1.drop(columns=["Total_Trans_Amt"], inplace=True)
In [79]:
plt.figure(figsize=(10,5))
sns.countplot(df1['Total_Trans_Amt_bins'])
Out[79]:
<matplotlib.axes._subplots.AxesSubplot at 0x1fb377b8520>
In [80]:
df1.dtypes
Out[80]:
Attrition_Flag               category
Customer_Age                    int64
Gender                       category
Dependent_count              category
Education_Level              category
Marital_Status               category
Income_Category                object
Card_Category                category
Months_on_book                  int64
Total_Relationship_Count     category
Months_Inactive_12_mon       category
Contacts_Count_12_mon        category
Credit_Limit_arc              float64
Total_Amt_Chng_Q4_Q1_arc      float64
Total_Ct_Chng_Q4_Q1_arc       float64
Avg_Utilization_Ratio_arc     float64
Total_Trans_Ct_bins          category
Total_Revolving_Bal_bins     category
Total_Trans_Amt_bins         category
dtype: object
In [81]:
df1["Income_Category"]=df1["Income_Category"].astype("category") #lets change the data type of Income_Category
In [82]:
df1["Attrition_Flag"].replace({"Existing Customer":0, "Attrited Customer":1}, inplace=True) # code Attrited Customer value as 1 and the other as 0
In [83]:
df1["Attrition_Flag"].value_counts() #recheck the values
Out[83]:
0    8500
1    1627
Name: Attrition_Flag, dtype: int64
In [84]:
df1.head()
Out[84]:
Attrition_Flag Customer_Age Gender Dependent_count Education_Level Marital_Status Income_Category Card_Category Months_on_book Total_Relationship_Count Months_Inactive_12_mon Contacts_Count_12_mon Credit_Limit_arc Total_Amt_Chng_Q4_Q1_arc Total_Ct_Chng_Q4_Q1_arc Avg_Utilization_Ratio_arc Total_Trans_Ct_bins Total_Revolving_Bal_bins Total_Trans_Amt_bins
0 0 45 M 3 High School Married $60K - $80K Blue 39 5 1 3 10.142 1.100 1.262 0.061 <50 <1000 <3000
1 0 49 F 5 Graduate Single Less than $40K Blue 44 6 1 2 9.712 1.217 2.023 0.105 <50 <1000 <3000
2 0 51 M 3 Graduate Married $80K - $120K Blue 36 4 1 0 8.830 1.682 1.583 0.000 <50 <1000 <3000
3 0 40 F 4 High School NaN Less than $40K Blue 34 3 4 1 8.799 1.141 1.583 0.701 <50 >2000 <3000
4 0 40 M 3 Uneducated Married $60K - $80K Blue 21 5 1 0 9.152 1.519 1.647 0.000 <50 <1000 <3000

Outliers Treatment

In [85]:
quartiles = np.quantile(df1['Customer_Age'][df1['Customer_Age'].notnull()], [.25, .75])
Customer_Age_4iqr = 4 * (quartiles[1] - quartiles[0])
outliers_Customer_Age = df1.loc[np.abs(df1['Customer_Age'] - df1['Customer_Age'].median()) > Customer_Age_4iqr, 'Customer_Age']
outliers_Customer_Age #There are no outliers
Out[85]:
Series([], Name: Customer_Age, dtype: int64)
In [86]:
quartiles = np.quantile(df1['Credit_Limit_arc'][df1['Credit_Limit_arc'].notnull()], [.25, .75])
Credit_Limit_arc_4iqr = 4 * (quartiles[1] - quartiles[0])
outliers_Credit_Limit_arc = df1.loc[np.abs(df1['Credit_Limit_arc'] - df1['Credit_Limit_arc'].median()) > Credit_Limit_arc_4iqr, 'Credit_Limit_arc']
outliers_Credit_Limit_arc  #There are no outliers
Out[86]:
Series([], Name: Credit_Limit_arc, dtype: float64)
In [87]:
quartiles = np.quantile(df1['Total_Ct_Chng_Q4_Q1_arc'][df1['Total_Ct_Chng_Q4_Q1_arc'].notnull()], [.25, .75])
Total_Ct_Chng_Q4_Q1_arc_4iqr = 4 * (quartiles[1] - quartiles[0])
outliers_Total_Ct_Chng_Q4_Q1_arc = df1.loc[np.abs(df1['Total_Ct_Chng_Q4_Q1_arc'] - df1['Total_Ct_Chng_Q4_Q1_arc'].median()) > Total_Ct_Chng_Q4_Q1_arc_4iqr, 'Total_Ct_Chng_Q4_Q1_arc']
outliers_Total_Ct_Chng_Q4_Q1_arc #There are outliers, drop the outliers
Out[87]:
1      2.023
2      1.583
3      1.583
4      1.647
12     1.895
13     1.444
30     1.673
68     1.609
69     1.444
84     1.444
91     1.522
113    1.818
131    1.530
146    1.778
158    1.621
162    1.516
167    1.565
190    1.818
231    1.444
239    1.559
269    1.966
280    1.609
294    1.480
300    1.444
309    1.487
366    1.736
456    1.444
757    1.539
773    1.985
805    1.647
1095   1.539
1256   1.444
1455   1.444
2510   1.647
Name: Total_Ct_Chng_Q4_Q1_arc, dtype: float64
In [88]:
df1.drop(outliers_Total_Ct_Chng_Q4_Q1_arc.index, axis=0, inplace=True)
In [89]:
quartiles = np.quantile(df1['Total_Amt_Chng_Q4_Q1_arc'][df1['Total_Amt_Chng_Q4_Q1_arc'].notnull()], [.25, .75])
Total_Amt_Chng_Q4_Q1_arc_4iqr = 4 * (quartiles[1] - quartiles[0])
outliers_Total_Amt_Chng_Q4_Q1_arc  = df1.loc[np.abs(df1['Total_Amt_Chng_Q4_Q1_arc'] - df1['Total_Amt_Chng_Q4_Q1_arc'].median()) > Total_Amt_Chng_Q4_Q1_arc_4iqr, 'Total_Amt_Chng_Q4_Q1_arc']
outliers_Total_Amt_Chng_Q4_Q1_arc  #There are outliers, drop the outliers
Out[89]:
6      1.432
7      1.531
8      1.925
46     1.577
47     1.593
58     1.560
142    1.442
154    1.496
177    1.467
219    1.597
284    1.507
431    1.454
466    1.559
658    1.563
841    1.521
1085   1.462
1219   1.489
1873   1.460
Name: Total_Amt_Chng_Q4_Q1_arc, dtype: float64
In [90]:
df1.drop(outliers_Total_Amt_Chng_Q4_Q1_arc.index, axis=0, inplace=True)
In [91]:
quartiles = np.quantile(df1['Avg_Utilization_Ratio_arc'][df1['Avg_Utilization_Ratio_arc'].notnull()], [.25, .75])
Avg_Utilization_Ratio_arc_4iqr = 4 * (quartiles[1] - quartiles[0])
outliers_Avg_Utilization_Ratio_arc  = df1.loc[np.abs(df1['Avg_Utilization_Ratio_arc'] - df1['Avg_Utilization_Ratio_arc'].median()) > Avg_Utilization_Ratio_arc_4iqr, 'Avg_Utilization_Ratio_arc']
outliers_Avg_Utilization_Ratio_arc #There is no outlier
Out[91]:
Series([], Name: Avg_Utilization_Ratio_arc, dtype: float64)
In [92]:
quartiles = np.quantile(df1['Months_on_book'][df1['Months_on_book'].notnull()], [.25, .75])
Months_on_book_4iqr = 4 * (quartiles[1] - quartiles[0])
outliers_Months_on_book  = df1.loc[np.abs(df1['Months_on_book'] - df1['Months_on_book'].median()) > Months_on_book_4iqr, 'Months_on_book']
outliers_Months_on_book  #There is no outlier
Out[92]:
Series([], Name: Months_on_book, dtype: int64)

Data Preparation for Modeling

Split data

In [93]:
df2=df1.copy()
In [94]:
X = df2.drop(["Attrition_Flag"], axis=1)
y = df2["Attrition_Flag"]
In [95]:
X_temp,X_test,y_temp,y_test=train_test_split(X,y, test_size=0.2, random_state=1, stratify=y) #split the data first temp and test and then split the temp data to train and val sets
X_train, X_val,y_train, y_val=train_test_split(X_temp, y_temp, test_size=0.25, random_state=1, stratify=y_temp)
In [96]:
print(X_train.shape, X_val.shape, X_test.shape)
(6045, 18) (2015, 18) (2015, 18)

Missing-Value Treatment

In [97]:
imp_mode = SimpleImputer(missing_values=np.nan, strategy="most_frequent")
cols_to_impute = ["Education_Level", "Marital_Status","Income_Category"]
In [98]:
# fit and transform the imputer on the train, val, and test sets
X_train[cols_to_impute] = imp_mode.fit_transform(X_train[cols_to_impute])
X_val[cols_to_impute] = imp_mode.transform(X_val[cols_to_impute])
X_test[cols_to_impute] = imp_mode.transform(X_test[cols_to_impute])
In [99]:
X_train = pd.get_dummies(data=X_train, drop_first=True)
X_val = pd.get_dummies(data=X_val, drop_first=True)
X_test = pd.get_dummies(data=X_test, drop_first=True)
In [100]:
X_train.shape
Out[100]:
(6045, 53)

Model evaluation criterion

The bank could face two type losses. If the model could not detect the attrited customer: loss of money and if the model could detect the exsisting customer as attrited it is the loss of time. So, Recall score will be used as evaluation metric for the model performance because loss of money is greater than loss of time.

In [101]:
def classification_model_performance(model, predictors, target):
    prediction = model.predict(predictors)

    accuracy = accuracy_score(target, prediction) 
    recall = recall_score(target, prediction) 
    precision = precision_score(target, prediction)
    f1 = f1_score(target, prediction) 

    # creating a dataframe of metrics
    df_performance = pd.DataFrame(
        {"Accuracy": accuracy, "Recall": recall, "Precision": precision, "F1": f1,},
        index=[0],
    )

    return df_performance
In [102]:
def confusion_matrix_classification(model, predictors, target):
    y_prediction = model.predict(predictors)
    cm = confusion_matrix(target, y_prediction)
    labels = np.asarray(
        [
            ["{0:0.0f}".format(item) + "\n{0:.2%}".format(item / cm.flatten().sum())]
            for item in cm.flatten()
        ]
    ).reshape(2, 2)

    plt.figure(figsize=(6, 4))
    sns.heatmap(cm, annot=labels, fmt="")
    plt.ylabel("True")
    plt.xlabel("Predicted")

Logistic Regression

In [103]:
lr = LogisticRegression(random_state=1)
lr.fit(X_train, y_train)
C:\Users\kayaf\anaconda3\lib\site-packages\sklearn\linear_model\_logistic.py:814: ConvergenceWarning: lbfgs failed to converge (status=1):
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(
Out[103]:
LogisticRegression(random_state=1)
In [104]:
Log_reg_model_performance_train= classification_model_performance(lr, X_train, y_train)
Log_reg_model_performance_train
Out[104]:
Accuracy Recall Precision F1
0 0.910 0.628 0.769 0.691
In [105]:
Log_reg_model_performance_val= classification_model_performance(lr, X_val, y_val)
Log_reg_model_performance_val
Out[105]:
Accuracy Recall Precision F1
0 0.907 0.603 0.772 0.677
In [106]:
confusion_matrix_classification(lr, X_train, y_train)
In [107]:
confusion_matrix_classification(lr, X_val, y_val)
Observation

The model is not overfitting but the recall score is so low.

Decision Tree

In [108]:
dtree=DecisionTreeClassifier(random_state=1)
dtree.fit(X_train,y_train)
Out[108]:
DecisionTreeClassifier(random_state=1)
In [109]:
d_tree_model_performance_train= classification_model_performance(dtree, X_train, y_train)
d_tree_model_performance_train
Out[109]:
Accuracy Recall Precision F1
0 1.000 1.000 1.000 1.000
In [110]:
d_tree_model_performance_val= classification_model_performance(dtree, X_val, y_val)
d_tree_model_performance_val
Out[110]:
Accuracy Recall Precision F1
0 0.886 0.606 0.659 0.631
In [111]:
confusion_matrix_classification(dtree, X_train, y_train)
In [112]:
confusion_matrix_classification(dtree, X_val, y_val)
Observation

The model is overfitting; the model performs well on the train set but not on the validation set.

Bagging Classifier

In [113]:
bagging= BaggingClassifier(random_state=1)
bagging.fit(X_train,y_train)
Out[113]:
BaggingClassifier(random_state=1)
In [114]:
bagging_model_performance_train= classification_model_performance(bagging, X_train, y_train)
bagging_model_performance_train
Out[114]:
Accuracy Recall Precision F1
0 0.994 0.972 0.991 0.981
In [115]:
bagging_model_performance_val= classification_model_performance(bagging, X_val, y_val)
bagging_model_performance_val
Out[115]:
Accuracy Recall Precision F1
0 0.917 0.652 0.794 0.716
In [116]:
confusion_matrix_classification(bagging, X_train, y_train)
In [117]:
confusion_matrix_classification(bagging, X_val, y_val)
Observation

The model is overfitting, the recall score is very high for train set but not validation set.

Random Forest Model

In [118]:
rf=RandomForestClassifier(random_state=1)
rf.fit(X_train,y_train)
Out[118]:
RandomForestClassifier(random_state=1)
In [119]:
random_forest_model_performance_train= classification_model_performance(rf, X_train, y_train)
random_forest_model_performance_train
Out[119]:
Accuracy Recall Precision F1
0 1.000 1.000 1.000 1.000
In [120]:
random_forest_model_performance_val= classification_model_performance(rf, X_val, y_val)
random_forest_model_performance_val
Out[120]:
Accuracy Recall Precision F1
0 0.934 0.683 0.884 0.771
In [121]:
confusion_matrix_classification(rf, X_train, y_train)
In [122]:
confusion_matrix_classification(rf, X_val, y_val)
Observation

The model is overfitting, it perform well on the train set but not on the val set, recall score is low on the val set.

Boosting

AdaBoost Classifier

In [123]:
ada = AdaBoostClassifier(random_state=1)
ada.fit(X_train,y_train)
Out[123]:
AdaBoostClassifier(random_state=1)
In [124]:
ada_boosting_model_performance_train= classification_model_performance(ada, X_train, y_train)
ada_boosting_model_performance_train
Out[124]:
Accuracy Recall Precision F1
0 0.921 0.681 0.797 0.735
In [125]:
ada_boosting_model_performance_val= classification_model_performance(ada, X_val, y_val)
ada_boosting_model_performance_val
Out[125]:
Accuracy Recall Precision F1
0 0.919 0.677 0.789 0.728
In [126]:
confusion_matrix_classification(ada, X_train, y_train)
In [127]:
confusion_matrix_classification(ada, X_val, y_val)
Observation

The model is not overfit but recall scores are so low on train and val sets.

Gradient Boosting Classifier

In [128]:
gbc = GradientBoostingClassifier(random_state=1)
gbc.fit(X_train,y_train)
Out[128]:
GradientBoostingClassifier(random_state=1)
In [129]:
gbc_model_performance_train= classification_model_performance(gbc, X_train, y_train)
gbc_model_performance_train
Out[129]:
Accuracy Recall Precision F1
0 0.940 0.741 0.868 0.799
In [130]:
gbc_model_performance_val= classification_model_performance(gbc, X_val, y_val)
gbc_model_performance_val
Out[130]:
Accuracy Recall Precision F1
0 0.935 0.717 0.857 0.781
In [131]:
confusion_matrix_classification(gbc, X_train, y_train)
In [132]:
confusion_matrix_classification(gbc, X_val, y_val)
Observation

The model is not overfitting, but the recall score on the train and val sets are low.

XGBoost Classifier

In [133]:
xgb = XGBClassifier(random_state=1, eval_metric='logloss')
xgb.fit(X_train,y_train)
C:\Users\kayaf\anaconda3\lib\site-packages\xgboost\sklearn.py:1224: UserWarning: The use of label encoder in XGBClassifier is deprecated and will be removed in a future release. To remove this warning, do the following: 1) Pass option use_label_encoder=False when constructing XGBClassifier object; and 2) Encode your labels (y) as integers starting with 0, i.e. 0, 1, 2, ..., [num_class - 1].
  warnings.warn(label_encoder_deprecation_msg, UserWarning)
Out[133]:
XGBClassifier(base_score=0.5, booster='gbtree', colsample_bylevel=1,
              colsample_bynode=1, colsample_bytree=1, enable_categorical=False,
              eval_metric='logloss', gamma=0, gpu_id=-1, importance_type=None,
              interaction_constraints='', learning_rate=0.300000012,
              max_delta_step=0, max_depth=6, min_child_weight=1, missing=nan,
              monotone_constraints='()', n_estimators=100, n_jobs=4,
              num_parallel_tree=1, predictor='auto', random_state=1,
              reg_alpha=0, reg_lambda=1, scale_pos_weight=1, subsample=1,
              tree_method='exact', validate_parameters=1, verbosity=None)
In [134]:
xgb_model_performance_train= classification_model_performance(xgb, X_train, y_train)
xgb_model_performance_train
Out[134]:
Accuracy Recall Precision F1
0 1.000 0.998 1.000 0.999
In [135]:
xgb_model_performance_val= classification_model_performance(xgb, X_val, y_val)
xgb_model_performance_val
Out[135]:
Accuracy Recall Precision F1
0 0.940 0.760 0.852 0.803
In [136]:
confusion_matrix_classification(xgb, X_val, y_val)
Observation

The model perfom well on train set but not perfect on the val set. The model is overfit.

Oversampling train data using SMOTE

In [137]:
print("Before oversampling, counts of label 'Yes': {}".format(sum(y_train == 1)))
print("Before oversampling, counts of label 'No': {} \n".format(sum(y_train == 0)))

sm = SMOTE(
    sampling_strategy=1, k_neighbors=5, random_state=1) 
X_train_over, y_train_over = sm.fit_resample(X_train, y_train)


print("After oversampling, counts of label 'Yes': {}".format(sum(y_train_over == 1)))
print("After oversampling, counts of label 'No': {} \n".format(sum(y_train_over == 0)))


print("After oversampling, the shape of train_X: {}".format(X_train_over.shape))
print("After oversampling, the shape of train_y: {} \n".format(y_train_over.shape))
Before oversampling, counts of label 'Yes': 975
Before oversampling, counts of label 'No': 5070 

After oversampling, counts of label 'Yes': 5070
After oversampling, counts of label 'No': 5070 

After oversampling, the shape of train_X: (10140, 53)
After oversampling, the shape of train_y: (10140,) 

Logistic Regression on oversampled data

In [138]:
lr_over = LogisticRegression(random_state=1)
lr_over.fit(X_train_over, y_train_over)
C:\Users\kayaf\anaconda3\lib\site-packages\sklearn\linear_model\_logistic.py:814: ConvergenceWarning: lbfgs failed to converge (status=1):
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(
Out[138]:
LogisticRegression(random_state=1)
In [139]:
lr_over_performance_train= classification_model_performance(lr_over, X_train_over, y_train_over)
lr_over_performance_train
Out[139]:
Accuracy Recall Precision F1
0 0.935 0.928 0.942 0.935
In [140]:
lr_over_performance_val= classification_model_performance(lr_over, X_val, y_val)
lr_over_performance_val
Out[140]:
Accuracy Recall Precision F1
0 0.894 0.634 0.687 0.659
In [141]:
confusion_matrix_classification(lr_over, X_train_over, y_train_over)
In [142]:
confusion_matrix_classification(lr_over, X_val, y_val)
Observation

The model is overfitting

Decision Tree on oversampled data

In [143]:
dtree_over=DecisionTreeClassifier(random_state=1)
dtree_over.fit(X_train_over, y_train_over)
Out[143]:
DecisionTreeClassifier(random_state=1)
In [144]:
dtree_over_performance_train= classification_model_performance(dtree_over, X_train_over, y_train_over)
dtree_over_performance_train
Out[144]:
Accuracy Recall Precision F1
0 1.000 1.000 1.000 1.000
In [145]:
dtree_over_performance_val= classification_model_performance(dtree_over, X_val, y_val)
dtree_over_performance_val
Out[145]:
Accuracy Recall Precision F1
0 0.883 0.711 0.619 0.662
In [146]:
confusion_matrix_classification(dtree_over, X_train_over, y_train_over)
In [147]:
confusion_matrix_classification(dtree_over, X_val, y_val)
Observation

The model is overfitting

Bagging Classifier on oversampled data

In [148]:
bagging_over= BaggingClassifier(random_state=1)
bagging_over.fit(X_train_over, y_train_over)
Out[148]:
BaggingClassifier(random_state=1)
In [149]:
bagging_over_performance_train= classification_model_performance(bagging_over, X_train_over, y_train_over)
bagging_over_performance_train
Out[149]:
Accuracy Recall Precision F1
0 0.996 0.994 0.997 0.996
In [150]:
bagging_over_performance_val= classification_model_performance(bagging_over, X_val, y_val)
bagging_over_performance_val
Out[150]:
Accuracy Recall Precision F1
0 0.918 0.726 0.756 0.741
In [151]:
confusion_matrix_classification(bagging_over, X_train_over, y_train_over)
In [152]:
confusion_matrix_classification(bagging_over, X_val, y_val)
Observation

The recall score is low on val set, the model is overfitting

Random Forest on oversampled data

In [153]:
rf_over=RandomForestClassifier(random_state=1)
rf_over.fit(X_train_over, y_train_over)
Out[153]:
RandomForestClassifier(random_state=1)
In [154]:
rf_over_performance_train= classification_model_performance(rf_over, X_train_over, y_train_over)
rf_over_performance_train
Out[154]:
Accuracy Recall Precision F1
0 1.000 1.000 1.000 1.000
In [155]:
rf_over_performance_val= classification_model_performance(rf_over, X_val, y_val)
rf_over_performance_val
Out[155]:
Accuracy Recall Precision F1
0 0.924 0.717 0.790 0.752
In [156]:
confusion_matrix_classification(rf_over, X_train_over, y_train_over)
In [157]:
confusion_matrix_classification(rf_over, X_val, y_val)
Observation

The model is overfitting

AdaBoost Classifier on oversampled data

In [158]:
ada_over = AdaBoostClassifier(random_state=1)
ada_over.fit(X_train_over, y_train_over)
Out[158]:
AdaBoostClassifier(random_state=1)
In [159]:
ada_over_performance_train= classification_model_performance(ada_over, X_train_over, y_train_over)
ada_over_performance_train
Out[159]:
Accuracy Recall Precision F1
0 0.939 0.945 0.933 0.939
In [160]:
ada_over_performance_val= classification_model_performance(ada_over, X_val, y_val)
ada_over_performance_val
Out[160]:
Accuracy Recall Precision F1
0 0.910 0.729 0.718 0.724
In [161]:
confusion_matrix_classification(ada_over, X_train_over, y_train_over)
In [162]:
confusion_matrix_classification(ada_over, X_val, y_val)

The model is overfitting

Gradient Boosting Classifier on oversampled data

In [163]:
gbc_over = GradientBoostingClassifier(random_state=1)
gbc_over.fit(X_train_over, y_train_over)
Out[163]:
GradientBoostingClassifier(random_state=1)
In [164]:
gbc_over_performance_train= classification_model_performance(gbc_over, X_train_over, y_train_over)
gbc_over_performance_train
Out[164]:
Accuracy Recall Precision F1
0 0.955 0.960 0.950 0.955
In [165]:
gbc_over_performance_val= classification_model_performance(gbc_over, X_val, y_val)
gbc_over_performance_val
Out[165]:
Accuracy Recall Precision F1
0 0.930 0.809 0.767 0.787
In [166]:
confusion_matrix_classification(gbc_over, X_train_over, y_train_over)
In [167]:
confusion_matrix_classification(gbc_over, X_val, y_val)

XGBoost Classifier on oversampled data

In [168]:
xgb_over = XGBClassifier(random_state=1, eval_metric='logloss')
xgb_over.fit(X_train_over, y_train_over)
C:\Users\kayaf\anaconda3\lib\site-packages\xgboost\sklearn.py:1224: UserWarning: The use of label encoder in XGBClassifier is deprecated and will be removed in a future release. To remove this warning, do the following: 1) Pass option use_label_encoder=False when constructing XGBClassifier object; and 2) Encode your labels (y) as integers starting with 0, i.e. 0, 1, 2, ..., [num_class - 1].
  warnings.warn(label_encoder_deprecation_msg, UserWarning)
Out[168]:
XGBClassifier(base_score=0.5, booster='gbtree', colsample_bylevel=1,
              colsample_bynode=1, colsample_bytree=1, enable_categorical=False,
              eval_metric='logloss', gamma=0, gpu_id=-1, importance_type=None,
              interaction_constraints='', learning_rate=0.300000012,
              max_delta_step=0, max_depth=6, min_child_weight=1, missing=nan,
              monotone_constraints='()', n_estimators=100, n_jobs=4,
              num_parallel_tree=1, predictor='auto', random_state=1,
              reg_alpha=0, reg_lambda=1, scale_pos_weight=1, subsample=1,
              tree_method='exact', validate_parameters=1, verbosity=None)
In [169]:
xgb_over_performance_train= classification_model_performance(xgb_over, X_train_over, y_train_over)
xgb_over_performance_train
Out[169]:
Accuracy Recall Precision F1
0 1.000 1.000 1.000 1.000
In [170]:
xgb_over_performance_val= classification_model_performance(xgb_over, X_val, y_val)
xgb_over_performance_val
Out[170]:
Accuracy Recall Precision F1
0 0.942 0.794 0.838 0.815
In [171]:
confusion_matrix_classification(xgb_over, X_train_over, y_train_over)
In [172]:
confusion_matrix_classification(xgb_over, X_val, y_val)
Observation

The oversampled models performed well on the train set but did not perform well on Val sets. Gradient Boosting Classifier on oversampled data is better than other models on oversampled data.

Undersampling train data using RandomUnderSampler

In [173]:
rus = RandomUnderSampler(random_state=1)
X_train_under, y_train_under = rus.fit_resample(X_train, y_train)

Logistic Regression on undersampled data

In [174]:
lr_under = LogisticRegression(random_state=1)
lr_under.fit(X_train_under, y_train_under)
C:\Users\kayaf\anaconda3\lib\site-packages\sklearn\linear_model\_logistic.py:814: ConvergenceWarning: lbfgs failed to converge (status=1):
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(
Out[174]:
LogisticRegression(random_state=1)
In [175]:
lr_under_performance_train= classification_model_performance(lr_under, X_train_under, y_train_under)
lr_under_performance_train
Out[175]:
Accuracy Recall Precision F1
0 0.874 0.879 0.870 0.874
In [176]:
lr_under_performance_val= classification_model_performance(lr_under, X_val, y_val)
lr_under_performance_val
Out[176]:
Accuracy Recall Precision F1
0 0.856 0.874 0.532 0.661
In [177]:
confusion_matrix_classification(lr_under, X_train_under, y_train_under)
In [178]:
confusion_matrix_classification(lr_under, X_val, y_val)

The recall scores on the train and val sets are good but not perfect, and accuracy scores on the train and val sets are low.

Decision Tree on undersampled data

In [179]:
dtree_under=DecisionTreeClassifier(random_state=1)
dtree_under.fit(X_train_under, y_train_under)
Out[179]:
DecisionTreeClassifier(random_state=1)
In [180]:
dtree_under_performance_train= classification_model_performance(dtree_under, X_train_under, y_train_under)
dtree_under_performance_train
Out[180]:
Accuracy Recall Precision F1
0 1.000 1.000 1.000 1.000
In [181]:
dtree_under_performance_val= classification_model_performance(dtree_under, X_val, y_val)
dtree_under_performance_val
Out[181]:
Accuracy Recall Precision F1
0 0.839 0.825 0.500 0.623
In [182]:
confusion_matrix_classification(dtree_under, X_train_under, y_train_under)
In [183]:
confusion_matrix_classification(dtree_under, X_val, y_val)

The model is overfitting.

Bagging Classifier on undersampled data

In [184]:
bagging_under= BaggingClassifier(random_state=1)
bagging_under.fit(X_train_under, y_train_under)
Out[184]:
BaggingClassifier(random_state=1)
In [185]:
bagging_under_performance_train= classification_model_performance(bagging_under, X_train_under, y_train_under)
bagging_under_performance_train
Out[185]:
Accuracy Recall Precision F1
0 0.993 0.990 0.997 0.993
In [186]:
bagging_under_performance_val= classification_model_performance(bagging_under, X_val, y_val)
bagging_under_performance_val
Out[186]:
Accuracy Recall Precision F1
0 0.876 0.855 0.578 0.690
In [187]:
confusion_matrix_classification(bagging_under, X_train_under, y_train_under)
In [188]:
confusion_matrix_classification(bagging_under, X_val, y_val)

The recall score on the val set is low.

Random Forest on undersampled data

In [189]:
rf_under=RandomForestClassifier(random_state=1)
rf_under.fit(X_train_under, y_train_under)
Out[189]:
RandomForestClassifier(random_state=1)
In [190]:
rf_under_performance_train= classification_model_performance(rf_under, X_train_under, y_train_under)
rf_under_performance_train
Out[190]:
Accuracy Recall Precision F1
0 1.000 1.000 1.000 1.000
In [191]:
rf_under_performance_val= classification_model_performance(rf_under, X_val, y_val)
rf_under_performance_val
Out[191]:
Accuracy Recall Precision F1
0 0.888 0.889 0.605 0.720
In [192]:
confusion_matrix_classification(rf_under, X_train_under, y_train_under)
In [193]:
confusion_matrix_classification(rf_under, X_val, y_val)

The model performs well on the train set but not well on the val set.

AdaBoost Classifier on undersampled data

In [194]:
ada_under = AdaBoostClassifier(random_state=1)
ada_under.fit(X_train_under, y_train_under)
Out[194]:
AdaBoostClassifier(random_state=1)
In [195]:
ada_under_performance_train= classification_model_performance(ada_under, X_train_under, y_train_under)
ada_under_performance_train
Out[195]:
Accuracy Recall Precision F1
0 0.889 0.889 0.888 0.889
In [196]:
ada_under_performance_val= classification_model_performance(ada_under, X_val, y_val)
ada_under_performance_val
Out[196]:
Accuracy Recall Precision F1
0 0.876 0.892 0.575 0.700
In [197]:
confusion_matrix_classification(ada_under, X_train_under, y_train_under)
In [198]:
confusion_matrix_classification(ada_under, X_val, y_val)

The recall scores are good and the model is not overfitting but the accuracy scores are low.

Gradient Boosting Classifier on undersampled data

In [199]:
gbc_under = GradientBoostingClassifier(random_state=1)
gbc_under.fit(X_train_under, y_train_under)
Out[199]:
GradientBoostingClassifier(random_state=1)
In [200]:
gbc_under_performance_train= classification_model_performance(gbc_under, X_train_under, y_train_under)
gbc_under_performance_train
Out[200]:
Accuracy Recall Precision F1
0 0.930 0.947 0.916 0.931
In [201]:
gbc_under_performance_val= classification_model_performance(gbc_under, X_val, y_val)
gbc_under_performance_val
Out[201]:
Accuracy Recall Precision F1
0 0.889 0.898 0.605 0.723
In [202]:
confusion_matrix_classification(gbc_under, X_train_under, y_train_under)
In [203]:
confusion_matrix_classification(gbc_under, X_val, y_val)

XGBoost Classifier on undersampled data

In [204]:
xgb_under = XGBClassifier(random_state=1, eval_metric='logloss')
xgb_under.fit(X_train_under, y_train_under)
C:\Users\kayaf\anaconda3\lib\site-packages\xgboost\sklearn.py:1224: UserWarning: The use of label encoder in XGBClassifier is deprecated and will be removed in a future release. To remove this warning, do the following: 1) Pass option use_label_encoder=False when constructing XGBClassifier object; and 2) Encode your labels (y) as integers starting with 0, i.e. 0, 1, 2, ..., [num_class - 1].
  warnings.warn(label_encoder_deprecation_msg, UserWarning)
Out[204]:
XGBClassifier(base_score=0.5, booster='gbtree', colsample_bylevel=1,
              colsample_bynode=1, colsample_bytree=1, enable_categorical=False,
              eval_metric='logloss', gamma=0, gpu_id=-1, importance_type=None,
              interaction_constraints='', learning_rate=0.300000012,
              max_delta_step=0, max_depth=6, min_child_weight=1, missing=nan,
              monotone_constraints='()', n_estimators=100, n_jobs=4,
              num_parallel_tree=1, predictor='auto', random_state=1,
              reg_alpha=0, reg_lambda=1, scale_pos_weight=1, subsample=1,
              tree_method='exact', validate_parameters=1, verbosity=None)
In [205]:
xgb_under_performance_train= classification_model_performance(xgb_under, X_train_under, y_train_under)
xgb_under_performance_train
Out[205]:
Accuracy Recall Precision F1
0 1.000 1.000 1.000 1.000
In [206]:
xgb_under_performance_val= classification_model_performance(xgb_under, X_val, y_val)
xgb_under_performance_val
Out[206]:
Accuracy Recall Precision F1
0 0.906 0.917 0.648 0.759
In [207]:
confusion_matrix_classification(xgb_under, X_train_under, y_train_under)
In [208]:
confusion_matrix_classification(xgb_under, X_val, y_val)

Comparing The Models Performances

In [209]:
models_comp = pd.concat( [
    Log_reg_model_performance_train.T,
    Log_reg_model_performance_val.T,
    d_tree_model_performance_train.T,
    d_tree_model_performance_val.T,
    bagging_model_performance_train.T,
    bagging_model_performance_val.T,
    random_forest_model_performance_train.T,
    random_forest_model_performance_val.T,
    ada_boosting_model_performance_train.T,
    ada_boosting_model_performance_val.T,
    gbc_model_performance_train.T,
    gbc_model_performance_val.T,
    xgb_model_performance_train.T,
    xgb_model_performance_val.T,
    lr_over_performance_train.T,
    lr_over_performance_val.T,
    dtree_over_performance_train.T,
    dtree_over_performance_val.T,
    bagging_over_performance_train.T,
    bagging_over_performance_val.T,
    rf_over_performance_train.T,
    rf_over_performance_val.T,
    ada_over_performance_train.T,
    ada_over_performance_val.T,
    gbc_over_performance_train.T,
    gbc_over_performance_val.T,
    xgb_over_performance_train.T,
    xgb_over_performance_val.T,
    lr_under_performance_train.T,
    lr_under_performance_val.T,
    dtree_under_performance_train.T,
    dtree_under_performance_val.T,
    bagging_under_performance_train.T,
    bagging_under_performance_val.T,
    rf_under_performance_train.T,
    rf_under_performance_val.T,
    ada_under_performance_train.T,
    ada_under_performance_val.T,
    gbc_under_performance_train.T,
    gbc_under_performance_val.T,
    xgb_under_performance_train.T,
    xgb_under_performance_val.T], axis=1)
    
models_comp.columns = [
    "Logistic Regression Train",
    "Logistic Regression Val",
    "Decision Tree Train",
    "Decision Tree Val",
    "Bagging Classifier Train",
    "Bagging Classifier Val",
    "Random Forest Model Train",
    "Random Forest Model Val",
    "AdaBoost Classifier Train",
    "AdaBoost Classifier Val",
    "Gradient Boosting Train",
    "Gradient Boosting Val",
    "XGBoost Classifier Train",
    "XGBoost Classifier Val",
    "Logistic Regression on oversampled Train",
    "Logistic Regression on oversampled Val",
    "Decision Tree on oversampled Train",
    "Decision Tree on oversampled Val",
    "Bagging Classifier on oversampled Train",
    "Bagging Classifier on oversampled Val",
    "Random Forest on oversampled Train",
    "Random Forest on oversampled Val",
    "AdaBoost Classifier on oversampled Train",
    "AdaBoost Classifier on oversampled Val",
    "Gradient Boosting Classifier on oversampled Train",
    "Gradient Boosting Classifier on oversampled Val",
    "XGBoost Classifier on oversampled Train",
    "XGBoost Classifier on oversampled Val",
    "Logistic Regression on undersampled Train",
    "Logistic Regression on undersampled Val",
    "Decision Tree on undersampled Train",
    "Decision Tree on undersampled Val",
    "Bagging Classifier on undersampled Train",
    "Bagging Classifier on undersampled Val",
    "Random Forest on undersampled Train",
    "Random Forest on undersampled Val",
    "AdaBoost Classifier on undersampled Train",
    "AdaBoost Classifier on undersampled Val",
    "Gradient Boosting Classifier on undersampled Train",
    "Gradient Boosting Classifier on undersampled Val", 
    "XGBoost Classifier on undersampled Train",
    "XGBoost Classifier on undersampled Val"]
In [210]:
models_comp
Out[210]:
Logistic Regression Train Logistic Regression Val Decision Tree Train Decision Tree Val Bagging Classifier Train Bagging Classifier Val Random Forest Model Train Random Forest Model Val AdaBoost Classifier Train AdaBoost Classifier Val Gradient Boosting Train Gradient Boosting Val XGBoost Classifier Train XGBoost Classifier Val Logistic Regression on oversampled Train Logistic Regression on oversampled Val Decision Tree on oversampled Train Decision Tree on oversampled Val Bagging Classifier on oversampled Train Bagging Classifier on oversampled Val Random Forest on oversampled Train Random Forest on oversampled Val AdaBoost Classifier on oversampled Train AdaBoost Classifier on oversampled Val Gradient Boosting Classifier on oversampled Train Gradient Boosting Classifier on oversampled Val XGBoost Classifier on oversampled Train XGBoost Classifier on oversampled Val Logistic Regression on undersampled Train Logistic Regression on undersampled Val Decision Tree on undersampled Train Decision Tree on undersampled Val Bagging Classifier on undersampled Train Bagging Classifier on undersampled Val Random Forest on undersampled Train Random Forest on undersampled Val AdaBoost Classifier on undersampled Train AdaBoost Classifier on undersampled Val Gradient Boosting Classifier on undersampled Train Gradient Boosting Classifier on undersampled Val XGBoost Classifier on undersampled Train XGBoost Classifier on undersampled Val
Accuracy 0.910 0.907 1.000 0.886 0.994 0.917 1.000 0.934 0.921 0.919 0.940 0.935 1.000 0.940 0.935 0.894 1.000 0.883 0.996 0.918 1.000 0.924 0.939 0.910 0.955 0.930 1.000 0.942 0.874 0.856 1.000 0.839 0.993 0.876 1.000 0.888 0.889 0.876 0.930 0.889 1.000 0.906
Recall 0.628 0.603 1.000 0.606 0.972 0.652 1.000 0.683 0.681 0.677 0.741 0.717 0.998 0.760 0.928 0.634 1.000 0.711 0.994 0.726 1.000 0.717 0.945 0.729 0.960 0.809 1.000 0.794 0.879 0.874 1.000 0.825 0.990 0.855 1.000 0.889 0.889 0.892 0.947 0.898 1.000 0.917
Precision 0.769 0.772 1.000 0.659 0.991 0.794 1.000 0.884 0.797 0.789 0.868 0.857 1.000 0.852 0.942 0.687 1.000 0.619 0.997 0.756 1.000 0.790 0.933 0.718 0.950 0.767 1.000 0.838 0.870 0.532 1.000 0.500 0.997 0.578 1.000 0.605 0.888 0.575 0.916 0.605 1.000 0.648
F1 0.691 0.677 1.000 0.631 0.981 0.716 1.000 0.771 0.735 0.728 0.799 0.781 0.999 0.803 0.935 0.659 1.000 0.662 0.996 0.741 1.000 0.752 0.939 0.724 0.955 0.787 1.000 0.815 0.874 0.661 1.000 0.623 0.993 0.690 1.000 0.720 0.889 0.700 0.931 0.723 1.000 0.759
Observation

Gradient Boosting Classifier on oversampled, XGBoost Classifier on undersampled, Gradient Boosting Classifier on undersampled models can tune to improve the performance of the models.

Gradient Boosting Classifier on undersampled data Tuned

In [211]:
model_gbc_under = GradientBoostingClassifier(random_state=1)

parameters = {
    "n_estimators": [50,100,150,200,250],
    "subsample":[0.8,0.9,1],
    "max_features":[0.7,0.8,0.9,1]
}

scorer = metrics.make_scorer(metrics.recall_score)

gbc_tuned1 =RandomizedSearchCV(estimator=model_gbc_under, param_distributions=parameters, n_iter=50, scoring=scorer, cv=5, random_state=1)
gbc_tuned1.fit(X_train_under, y_train_under)


print("Best parameters are {} with CV score={}:" .format(gbc_tuned1.best_params_,gbc_tuned1.best_score_))
Best parameters are {'subsample': 0.9, 'n_estimators': 250, 'max_features': 0.8} with CV score=0.914871794871795:
In [212]:
gbc_under_tuned_performance_train= classification_model_performance(gbc_tuned1, X_train_under, y_train_under)
gbc_under_tuned_performance_train
Out[212]:
Accuracy Recall Precision F1
0 0.972 0.979 0.965 0.972
In [213]:
gbc_under_tuned_performance_val= classification_model_performance(gbc_tuned1, X_val, y_val)
gbc_under_tuned_performance_val
Out[213]:
Accuracy Recall Precision F1
0 0.905 0.908 0.647 0.755

XGBoost Classifier on undersampled Data Tuned

In [214]:
model_xgb_under = XGBClassifier(random_state=1,eval_metric='logloss')

# Parameter grid to pass in RandomizedSearchCV
param_grid={'n_estimators':np.arange(50,200,50),
            'learning_rate':[0.01,0.1,0.2,0.05],
            'gamma':[0,1,3,5],
            'subsample':[0.8,0.9,1],
            'max_depth':np.arange(1,5,1)
          }

# Type of scoring used to compare parameter combinations
scorer = metrics.make_scorer(metrics.recall_score)

#Calling RandomizedSearchCV
xgb_tuned1 = RandomizedSearchCV(estimator=model_xgb_under, param_distributions=param_grid, n_iter=100, scoring=scorer, cv=10, random_state=1, n_jobs = -1)

#Fitting parameters in RandomizedSearchCV
xgb_tuned1.fit(X_train_under,y_train_under)

print("Best parameters are {} with CV score={}:" .format(xgb_tuned1.best_params_,xgb_tuned1.best_score_))
Best parameters are {'subsample': 1, 'n_estimators': 50, 'max_depth': 1, 'learning_rate': 0.01, 'gamma': 5} with CV score=0.9251630549126869:
C:\Users\kayaf\anaconda3\lib\site-packages\xgboost\sklearn.py:1224: UserWarning: The use of label encoder in XGBClassifier is deprecated and will be removed in a future release. To remove this warning, do the following: 1) Pass option use_label_encoder=False when constructing XGBClassifier object; and 2) Encode your labels (y) as integers starting with 0, i.e. 0, 1, 2, ..., [num_class - 1].
  warnings.warn(label_encoder_deprecation_msg, UserWarning)
In [215]:
xgb_under_tuned_performance_train= classification_model_performance(xgb_tuned1, X_train_under, y_train_under)
xgb_under_tuned_performance_train
Out[215]:
Accuracy Recall Precision F1
0 0.744 0.925 0.679 0.783
In [216]:
xgb_under_tuned_performance_val= classification_model_performance(xgb_tuned1, X_val, y_val)
xgb_under_tuned_performance_val
Out[216]:
Accuracy Recall Precision F1
0 0.594 0.917 0.273 0.421

Gradient Boosting Classifier on oversampled data Tuned

In [217]:
model = GradientBoostingClassifier(random_state=1)

parameters = {
    "n_estimators": [50,100,150,200,250],
    "subsample":[0.8,0.9,1],
    "max_features":[0.7,0.8,0.9,1]
}

scorer = metrics.make_scorer(metrics.recall_score)

gbc_tuned =RandomizedSearchCV(estimator=model, param_distributions=parameters, n_iter=50, scoring=scorer, cv=5, random_state=1)
gbc_tuned.fit(X_train_over, y_train_over)


print("Best parameters are {} with CV score={}:" .format(gbc_tuned.best_params_,gbc_tuned.best_score_))
Best parameters are {'subsample': 1, 'n_estimators': 50, 'max_features': 0.8} with CV score=0.9236686390532544:
In [218]:
gbc_over_tuned_performance_train= classification_model_performance(gbc_tuned, X_train_over, y_train_over)
gbc_over_tuned_performance_train
Out[218]:
Accuracy Recall Precision F1
0 0.936 0.951 0.922 0.937
In [219]:
gbc_over_tuned_performance_val= classification_model_performance(gbc_tuned, X_val, y_val)
gbc_over_tuned_performance_val
Out[219]:
Accuracy Recall Precision F1
0 0.914 0.815 0.699 0.753

Select the Gradient Boosting Classifier on undersampled data Tuned as a final model because this model is not overfitting, has a higher recall score on the train and validation sets, and has high accuracy score. The recall score is the evaluation metric of the model, but we also look at the other metrics such as the accuracy score to decide the final model because if the recall score is so high and the accuracy score is so low that means the model does not detect the attrited and existing customers. Therefore, the Gradient Boosting Classifier on undersampled data Tuned model has high recall and high accuracy scores. Before looking at the importance of the features of the model, that let's check the model performance on the test set

In [220]:
gbc_under_tuned_performance_test= classification_model_performance(gbc_tuned1, X_test, y_test)
gbc_under_tuned_performance_test
Out[220]:
Accuracy Recall Precision F1
0 0.909 0.914 0.656 0.763
In [221]:
top_params=gbc_tuned1.best_params_
top_params
Out[221]:
{'subsample': 0.9, 'n_estimators': 250, 'max_features': 0.8}
In [222]:
feature_names = X_train.columns
gbm_model = GradientBoostingClassifier(random_state=1, subsample= 0.9, n_estimators = 250, max_features = 0.8)
gbm_model.fit(X_train_under,y_train_under)
importances=gbm_model.feature_importances_
indices = np.argsort(importances)

plt.figure(figsize=(12, 12))
plt.title("Feature Importances")
plt.barh(range(len(indices)), importances[indices], color="violet", align="center")
plt.yticks(range(len(indices)), [feature_names[i] for i in indices])
plt.xlabel("Relative Importance")
plt.show()
Observation

Gradient Boosting Classifier on undersampled data Tuned is the best model among the other models, so this model can be used as a final model. The model also give a good performance on the test set. Total_Trans_Amt_bins_3000-6000, Total_Ct_Chng_Q4_Q1_arc, and Total_Revolving_Bal_bins_1000-2000 are the most important three features in the model.

Pipeline for the Model

In [223]:
numerical_features = ['Customer_Age', 'Months_on_book', 'Credit_Limit_arc',
       'Total_Amt_Chng_Q4_Q1_arc', 'Total_Ct_Chng_Q4_Q1_arc',
       'Avg_Utilization_Ratio_arc']
numeric_transformer = Pipeline(steps=[("imputer", SimpleImputer(strategy="median"))])
In [224]:
categorical_features = ['Gender','Dependent_count','Education_Level','Marital_Status','Income_Category','Card_Category','Total_Relationship_Count','Months_Inactive_12_mon','Contacts_Count_12_mon','Total_Trans_Ct_bins','Total_Revolving_Bal_bins','Total_Trans_Amt_bins']
categorical_transformer = Pipeline(
    steps=[
        ("imputer", SimpleImputer(strategy="most_frequent")),
        ("onehot", OneHotEncoder(handle_unknown="ignore"))])
In [225]:
preprocessor = ColumnTransformer(
    transformers=[
        ("num", numeric_transformer, numerical_features),
        ("cat", categorical_transformer, categorical_features),
    ],
    remainder="passthrough",
)
In [226]:
X = df2.drop("Attrition_Flag", axis=1)
Y = df2["Attrition_Flag"]
In [227]:
X_train, X_test, y_train, y_test = train_test_split(
    X, Y, test_size=0.30, random_state=1, stratify=Y
)
print(X_train.shape, X_test.shape)
(7052, 18) (3023, 18)
In [228]:
model = Pipeline(
    steps=[
        ("pre", preprocessor),
        (
            "GBC",
             GradientBoostingClassifier(
                random_state = 1,
                subsample = 0.9,
                n_estimators = 250, 
                max_features = 0.8
            ),
        ),
    ]
)
In [229]:
model.fit(X_train, y_train)
Out[229]:
Pipeline(steps=[('pre',
                 ColumnTransformer(remainder='passthrough',
                                   transformers=[('num',
                                                  Pipeline(steps=[('imputer',
                                                                   SimpleImputer(strategy='median'))]),
                                                  ['Customer_Age',
                                                   'Months_on_book',
                                                   'Credit_Limit_arc',
                                                   'Total_Amt_Chng_Q4_Q1_arc',
                                                   'Total_Ct_Chng_Q4_Q1_arc',
                                                   'Avg_Utilization_Ratio_arc']),
                                                 ('cat',
                                                  Pipeline(steps=[('imputer',
                                                                   SimpleImputer(strategy='most_fre...
                                                  ['Gender', 'Dependent_count',
                                                   'Education_Level',
                                                   'Marital_Status',
                                                   'Income_Category',
                                                   'Card_Category',
                                                   'Total_Relationship_Count',
                                                   'Months_Inactive_12_mon',
                                                   'Contacts_Count_12_mon',
                                                   'Total_Trans_Ct_bins',
                                                   'Total_Revolving_Bal_bins',
                                                   'Total_Trans_Amt_bins'])])),
                ('GBC',
                 GradientBoostingClassifier(max_features=0.8, n_estimators=250,
                                            random_state=1, subsample=0.9))])

Conclusion and Business Recommendations

1)The best model is Gradient Boosting Classifier on undersampled data Tuned. The model has high recall scores on the train, validation, and test sets. The precision score is low, which means the model is not good to detect existing customers. 2) The value of 3000-6000 for Total Transaction Amount is the most important feature of the model. The bank should look at this feature when they detect the attrited and existing customers. 3)Education level, marital status, and gender are not very important features of the model. 4)Ratio of the total transaction count in the 4th quarter and the total transaction count in the 1st quarter is a second important feature of the model to detect attrited customers.